Automatic Segmentation and Summarization of Spoken Lectures
نویسنده
چکیده
The ever-increasing number of online lectures has created an unprecedented opportunity for distance learning. Most online lectures are presented as unstructured text, audio and/or video files which make it di cult for students to locate relevant lectures and browse through them. In this thesis, we investigated several automatic lecture segmentation and summarization algorithms. Automatic lecture segmentation algorithms impose structure by inserting paragraph separators and section titles to the lecture to make them more readable. The Segmentation algorithms include K-Means, HMM, Hierarchical Segmentation. Unlike segmentation, Automatic lecture summarization algorithms compress lectures by selecting salient information based on importance of words/phrases and provide concise, coherent and readable summaries. The summarization algorithms were developed based on two supervised machine learning approaches: Support Vector Machine (SVM) and Conditional Random Fields (CRF). These approaches show comparable results with existing summarization methods. In this thesis we propose a novel Rhetorical Structure Index(RSI) to measure the structural importance of a feature. Experiments show that using RSI has significantly improved the segmentation accuracy as compared to the traditional contentbased feature weighting scheme such as TF/IDF. We also build summarization systems with structure-dependent models such that summarization generated for“Introduction” is di↵erent from that for “Main Lecture Content” and “Conclusion”. From all three segmentation algorithms HMM showed the best performance. However, similar to K-Means, HMM has definite number of segments which is not the case for Hierarchical Segmentation. Experiments show that using structure-dependent summarization outperforms the uniform summarization method by 15%.
منابع مشابه
A browsing system for classroom lecture speech
Developing technologies to summarize and retrieve huge quantities of spoken documents, recorded during classroom lectures, for the purpose of e-Learning or self-learning are important. In this paper, we describe an adaptation method of a language model to recognize keywords in given slides. Next, we propose a summarization method for spoken classroom lectures using prosodic features and linguis...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملIntonational phrases for speech summarization
Extractive speech summarization approaches select relevant segments of spoken documents and concatenate them to generate a summary. The extraction unit chosen, whether a sentence, syntactic constituent, or other segment, has a significant impact on the overall quality and fluency of the summary. Even though sentences tend to be the choice of most the extractive speech summarizers, in this paper...
متن کاملAutomatic extraction of cue phrases for important sentences in lecture speech and automatic lecture speech summarization
We automatically extract the summaries of spoken class lectures. This paper presents a novel method for sentence extraction-based automatic speech summarization. We propose a technique that extracts “cue phrases for important sentences (CPs)” that often appear in important sentences. We formulate CP extraction as a labeling problem of word sequences and use Conditional Random Fields (CRF) [1] f...
متن کاملMandarin Chinese Broadcast News Retrieval and Summarization Using Probabilistic Generative Models
This paper presents our recent research work on applying probabilistic generative models to Mandarin Chinese broadcast news retrieval and summarization. Most models can be trained in either a supervised or unsupervised manner. In addition, both literal term matching and concept matching strategies have been intensively investigated. This paper also presents a prototype web-based Mandarin Chines...
متن کامل